We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
We present a unified and compact representation for object rendering, 3D reconstruction, and grasp pose prediction that can be inferred from a single image within a few seconds. We achieve this by leveraging recent advances in the Neural Radiance Field (NeRF) literature that learn category-level priors and fine-tune on novel objects with minimal data and time. Our insight is that we can learn a compact shape representation and extract meaningful additional information from it, such as grasping poses. We believe this to be the first work to retrieve grasping poses directly from a NeRF-based representation using a single viewpoint (RGB-only), rather than going through a secondary network and/or representation. When compared to prior art, our method is two to three orders of magnitude smaller while achieving comparable performance at view reconstruction and grasping. Accompanying our method, we also propose a new dataset of rendered shoes for training a sim-2-real NeRF method with grasping poses for different widths of grippers.
translated by 谷歌翻译
任务计划可能需要定义有关机器人需要采取行动的世界的无数领域知识。为了改善这项工作,可以使用大型语言模型(LLM)在任务计划期间为潜在的下一个操作评分,甚至直接生成动作序列,鉴于没有其他域信息的自然语言指令。但是,这样的方法要么需要列举所有可能的下一步评分,要么生成可能包含在当前机器人中给定机器人上不可能操作的自由形式文本。我们提出了一个程序化的LLM提示结构,该结构能够跨越位置环境,机器人功能和任务的计划生成功能。我们的关键见解是提示LLM具有环境中可用操作和对象的类似程序的规格,以及可以执行的示例程序。我们通过消融实验提出了有关迅速结构和生成约束的具体建议,证明了虚拟屋家庭任务中最先进的成功率,并将我们的方法部署在桌面任务的物理机器人组上。网站progprompt.github.io
translated by 谷歌翻译
标量和矢量场的神经近似(例如签名距离函数和辐射场)已成为准确的高质量表示。最先进的结果是通过从可训练的特征网格中进行查找的调节来获得的,这些近似是按照学习任务的一部分,并允许较小,更有效的神经网络。不幸的是,与独立的神经网络模型相比,这些特征网格通常以明显增加的记忆消耗成本。我们提出了一种词典方法,用于压缩此类特征网格,将其内存消耗降低至100倍,并允许多分辨率表示,这对于核心外流很有用。我们将词典优化作为矢量定量的自动码头问题提出,使我们能够在没有直接监督以及具有动态拓扑和结构的空间中学习端到端离散的神经表示。我们的源代码将在https://github.com/nv-tlabs/vqad上找到。
translated by 谷歌翻译
We present a new dataset for 6-DoF pose estimation of known objects, with a focus on robotic manipulation research. We propose a set of toy grocery objects, whose physical instantiations are readily available for purchase and are appropriately sized for robotic grasping and manipulation. We provide 3D scanned textured models of these objects, suitable for generating synthetic training data, as well as RGBD images of the objects in challenging, cluttered scenes exhibiting partial occlusion, extreme lighting variations, multiple instances per image, and a large variety of poses. Using semi-automated RGBD-to-model texture correspondences, the images are annotated with ground truth poses accurate within a few millimeters. We also propose a new pose evaluation metric called ADD-H based on the Hungarian assignment algorithm that is robust to symmetries in object geometry without requiring their explicit enumeration. We share pre-trained pose estimators for all the toy grocery objects, along with their baseline performance on both validation and test sets. We offer this dataset to the community to help connect the efforts of computer vision researchers with the needs of roboticists.
translated by 谷歌翻译
控制铰接对象时控制其姿势对于电影虚拟现实或动画等应用至关重要。然而,操纵对象的姿势需要了解其基础结构,即其关节以及它们如何互相互动。不幸的是,假设要知道的结构,因为现有方法所做的,排除了在新的对象类别上工作的能力。我们建议通过观察它们从多个视图移动,没有额外的监督,例如联合注释或有关该结构的信息,从而了解先前看不见的对象的外观和结构。我们的洞察力是,相对于彼此移动的相邻部件必须通过接头连接。为了利用这一观察,我们将3D的物体部分塑造为椭圆体,这使我们能够识别关节。我们将这种明确表示与隐式的表示,该显式表示可以补偿引入的近似值。我们表明我们的方法为不同的结构,从四足动物到单臂机器人到人类工作。
translated by 谷歌翻译
使用单视图2D照片仅集合,无监督的高质量多视图 - 一致的图像和3D形状一直是一个长期存在的挑战。现有的3D GAN是计算密集型的,也是没有3D-一致的近似;前者限制了所生成的图像的质量和分辨率,并且后者对多视图一致性和形状质量产生不利影响。在这项工作中,我们提高了3D GAN的计算效率和图像质量,而无需依赖这些近似。为此目的,我们介绍了一种表现力的混合明确隐式网络架构,与其他设计选择一起,不仅可以实时合成高分辨率多视图一致图像,而且还产生高质量的3D几何形状。通过解耦特征生成和神经渲染,我们的框架能够利用最先进的2D CNN生成器,例如Stylega2,并继承它们的效率和表现力。在其他实验中,我们展示了与FFHQ和AFHQ猫的最先进的3D感知合成。
translated by 谷歌翻译
Accurate determination of a small molecule candidate (ligand) binding pose in its target protein pocket is important for computer-aided drug discovery. Typical rigid-body docking methods ignore the pocket flexibility of protein, while the more accurate pose generation using molecular dynamics is hindered by slow protein dynamics. We develop a tiered tensor transform (3T) algorithm to rapidly generate diverse protein-ligand complex conformations for both pose and affinity estimation in drug screening, requiring neither machine learning training nor lengthy dynamics computation, while maintaining both coarse-grain-like coordinated protein dynamics and atomistic-level details of the complex pocket. The 3T conformation structures we generate are closer to experimental co-crystal structures than those generated by docking software, and more importantly achieve significantly higher accuracy in active ligand classification than traditional ensemble docking using hundreds of experimental protein conformations. 3T structure transformation is decoupled from the system physics, making future usage in other computational scientific domains possible.
translated by 谷歌翻译
Adversarial imitation learning (AIL) has become a popular alternative to supervised imitation learning that reduces the distribution shift suffered by the latter. However, AIL requires effective exploration during an online reinforcement learning phase. In this work, we show that the standard, naive approach to exploration can manifest as a suboptimal local maximum if a policy learned with AIL sufficiently matches the expert distribution without fully learning the desired task. This can be particularly catastrophic for manipulation tasks, where the difference between an expert and a non-expert state-action pair is often subtle. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks in addition to a main task. The addition of these auxiliary tasks forces the agent to explore states and actions that standard AIL may learn to ignore. Additionally, this particular formulation allows for the reusability of expert data between main tasks. Our experimental results in a challenging multitask robotic manipulation domain indicate that LfGP significantly outperforms both AIL and behaviour cloning, while also being more expert sample efficient than these baselines. To explain this performance gap, we provide further analysis of a toy problem that highlights the coupling between a local maximum and poor exploration, and also visualize the differences between the learned models from AIL and LfGP.
translated by 谷歌翻译
Many problems in machine learning involve bilevel optimization (BLO), including hyperparameter optimization, meta-learning, and dataset distillation. Bilevel problems consist of two nested sub-problems, called the outer and inner problems, respectively. In practice, often at least one of these sub-problems is overparameterized. In this case, there are many ways to choose among optima that achieve equivalent objective values. Inspired by recent studies of the implicit bias induced by optimization algorithms in single-level optimization, we investigate the implicit bias of gradient-based algorithms for bilevel optimization. We delineate two standard BLO methods -- cold-start and warm-start -- and show that the converged solution or long-run behavior depends to a large degree on these and other algorithmic choices, such as the hypergradient approximation. We also show that the inner solutions obtained by warm-start BLO can encode a surprising amount of information about the outer objective, even when the outer parameters are low-dimensional. We believe that implicit bias deserves as central a role in the study of bilevel optimization as it has attained in the study of single-level neural net optimization.
translated by 谷歌翻译